54 research outputs found

    Haplotype Variety Analysis of Human Populations: an Application to HapMap Data

    Get PDF
    We undertake a study to investigate the haplotype variety of distinct human populations. We use a natural measure of haplotype variety, the total number of haplotypes (TNH) present that reflects the number of haplotypes with nonzero frequencies estimated from the data at hand for each selection of multiple loci. For the analysis of real human populations, we use the haplotype data of the Denver Chinese, Tuscan Italians, Luhya Kenyans, and Gujarati Indians from release III of the HapMap database. Moreover, we show that the TNH statistic is biased in small sample data scenarios such as the HapMap and implement a nested simulation study to estimate and remove such bias. We perform a preliminary analysis of means and variances of the population allele frequencies in the four populations. Lastly, we implement a generalized linear model to detect and quantify the differences in haplotype structures of these populations. Our results show that all populations possess significantly different adjusted average TNH values. Our findings extend previous results based on alternative statistical approaches and demonstrate the existence of pronounced differences in the haplotype variety of the analyzed populations even after controlling for haplotype span as well as all allele frequencies and their two-way interactions

    A Kinship-Based Modification of the Armitage Trend Test to Address Hidden Population Structure and Small Differential Genotyping Errors

    Get PDF
    BACKGROUND/AIMS: We propose a modification of the well-known Armitage trend test to address the problems associated with hidden population structure and hidden relatedness in genome-wide case-control association studies. METHODS: The new test adopts beneficial traits from three existing testing strategies: the principal components, mixed model, and genomic control while avoiding some of their disadvantageous characteristics, such as the tendency of the principal components method to over-correct in certain situations or the failure of the genomic control approach to reorder the adjusted tests based on their degree of alignment with the underlying hidden structure. The new procedure is based on Gauss-Markov estimators derived from a straightforward linear model with an imposed variance structure proportional to an empirical relatedness matrix. Lastly, conceptual and analytical similarities to and distinctions from other approaches are emphasized throughout. RESULTS: Our simulations show that the power performance of the proposed test is quite promising compared to the considered competing strategies. The power gains are especially large when small differential differences between cases and controls are present; a likely scenario when public controls are used in multiple studies. CONCLUSION: The proposed modified approach attains high power more consistently than that of the existing commonly implemented tests. Its performance improvement is most apparent when small but detectable systematic differences between cases and controls exist

    A Two-Light Version of the Classical Hundred Prisoners and a Light Bulb Problem: Optimizing Experimental Design through Simulations

    Get PDF
    We propose five original strategies of successively increasing complexity and efficiency that address a novel version of a classical mathematical problem that, in essence, focuses on the determination of an optimal protocol for exchanging limited amounts of information among a group of subjects with various prerogatives. The inherent intricacy of the problem�solving protocols eliminates the possibility to attain an analytical solution. Therefore, we implemented a large-scale simulation study to exhaustively search through an extensive list of competing algorithms associated with the above-mentioned 5 generally defined protocols. Our results show that the consecutive improvements in the average amount of time necessary for the strategy-specific problem-solving completion over the previous simpler and less advantageously structured designs were 18, 30, 12, and 9% respectively. The optimal multi-stage information exchange strategy allows for a successful execution of the task of interest in 1722 days (4.7 years) on average with standard deviation of 385 days. The execution of this protocol took as few as 1004 and as many as 4965 with median of 1616 days

    An R package for parametric estimation of causal effects

    Full text link
    This article explains the usage of R package CausalModels, which is publicly available on the Comprehensive R Archive Network. While packages are available for sufficiently estimating causal effects, there lacks a package that provides a collection of structural models using the conventional statistical approach developed by Hernan and Robins (2020). CausalModels addresses this deficiency of software in R concerning causal inference by offering tools for methods that account for biases in observational data without requiring extensive statistical knowledge. These methods should not be ignored and may be more appropriate or efficient in solving particular problems. While implementations of these statistical models are distributed among a number of causal packages, CausalModels introduces a simple and accessible framework for a consistent modeling pipeline among a variety of statistical methods for estimating causal effects in a single R package. It consists of common methods including standardization, IP weighting, G-estimation, outcome regression, instrumental variables and propensity matching

    On the ranking of the disease susceptibility locus in family-based candidate gene studies: a simulation-based analysis

    Get PDF
    The ranking of the p-value of the true causal single nucleotide polymorphism in the ordered list of individual SNP p-values is an important factor for achieving success in the ultimate objective of association studies - identifying deleterious genetic variants. Thus, we undertake a study to assess the implications of complex, multimarker correlation structure, sample size and disease models on the ranking of the causal SNP. We carry out an extensive family-based candidate gene simulation study to analyze the position of the disease susceptibility locus in the complete list of individual SNP p-values ordered according to their statistical significance. We simulate data based on the haplotype distributions of ten randomly selected genes extracted from the HapMap database, various sample sizes (600,1000 and 2000) that current association studies employ, and disease models that mimic the characteristics of complex human disorders. We conclude that the average ranking of the causal SNP for sample sizes 600, 100 and 200 of 10.97, 9.65, and 8.34 are dramatically distant from the most significant and intuitively appropriate top position. This result is even more pronounced for genes with high average correlation and large number of common SNPs. Moreover, the gain of the DSL ranking when comparing sample sizes 600 to 1000 and 1000 to 2000, averaged over disease models, causal SNPs and genes, was approximately 1.3. These outcomes both reveal the importance of the sample size and quantify the magnitude required to unequivocally determine the identity of the DSL in family-based candidate gene studies. Our results show the overwhelming importance of large sample sizes in the localization of deleterious SNPs even under simple disease models. These conclusions possess pronounced importance for the design and result interpretation of candidate gene, next generation high-density genome-wide association studies, as well as for the construction and implementation of association tests based on the distribution of the most significant (minimum p-value) test statistics

    A Comparative Study on Deep Learning Models for Text Classification of Unstructured Medical Notes with Various Levels of Class Imbalance

    Get PDF
    Background Discharge medical notes written by physicians contain important information about the health condition of patients. Many deep learning algorithms have been successfully applied to extract important information from unstructured medical notes data that can entail subsequent actionable results in the medical domain. This study aims to explore the model performance of various deep learning algorithms in text classification tasks on medical notes with respect to different disease class imbalance scenarios. Methods In this study, we employed seven artificial intelligence models, a CNN (Convolutional Neural Network), a Transformer encoder, a pretrained BERT (Bidirectional Encoder Representations from Transformers), and four typical sequence neural networks models, namely, RNN (Recurrent Neural Network), GRU (Gated Recurrent Unit), LSTM (Long Short-Term Memory), and Bi-LSTM (Bi-directional Long Short-Term Memory) to classify the presence or absence of 16 disease conditions from patients’ discharge summary notes. We analyzed this question as a composition of 16 binary separate classification problems. The model performance of the seven models on each of the 16 datasets with various levels of imbalance between classes were compared in terms of AUC-ROC (Area Under the Curve of the Receiver Operating Characteristic), AUC-PR (Area Under the Curve of Precision and Recall), F1 Score, and Balanced Accuracy as well as the training time. The model performances were also compared in combination with different word embedding approaches (GloVe, BioWordVec, and no pre-trained word embeddings). Results The analyses of these 16 binary classification problems showed that the Transformer encoder model performs the best in nearly all scenarios. In addition, when the disease prevalence is close to or greater than 50%, the Convolutional Neural Network model achieved a comparable performance to the Transformer encoder, and its training time was 17.6% shorter than the second fastest model, 91.3% shorter than the Transformer encoder, and 94.7% shorter than the pre-trained BERT-Base model. The BioWordVec embeddings slightly improved the performance of the Bi-LSTM model in most disease prevalence scenarios, while the CNN model performed better without pre-trained word embeddings. In addition, the training time was significantly reduced with the GloVe embeddings for all models. Conclusions For classification tasks on medical notes, Transformer encoders are the best choice if the computation resource is not an issue. Otherwise, when the classes are relatively balanced, CNNs are a leading candidate because of their competitive performance and computational efficiency

    Pitcher Effectiveness: A Step Forward for In Game Analytics and Pitcher Evaluation

    Get PDF
    With the introduction of Statcast in 2015, baseball analytics have become more precise. Statcast allows every play to be accurately tracked and the data it generates is easily accessible through Baseball Savant, which opens the opportunity for improved performance statistics to be developed. In this paper we propose a new tool, Pitcher Effectiveness, that uses Statcast data to evaluate starting pitchers dynamically, based on the results of in-game outcomes after each pitch. Pitcher Effectiveness successfully predicts instances where starting pitchers give up several runs, which we believe make it a new and important tool for the in-game and post-game evaluation of starting pitchers

    Assessing the Reidentification Risks Posed by Deep Learning Algorithms Applied to ECG Data

    Get PDF
    ECG (Electrocardiogram) data analysis is one of the most widely used and important tools in cardiology diagnostics. In recent years the development of advanced deep learning techniques and GPU hardware have made it possible to train neural network models that attain exceptionally high levels of accuracy in complex tasks such as heart disease diagnoses and treatments. We investigate the use of ECGs as biometrics in human identification systems by implementing state-of-the-art deep learning models. We train convolutional neural network models on approximately 81k patients from the US, Germany and China. Currently, this is the largest research project on ECG identification. Our models achieved an overall accuracy of 95.69%. Furthermore, we assessed the accuracy of our ECG identification model for distinct groups of patients with particular heart conditions and combinations of such conditions. For example, we observed that the identification accuracy was the highest (99.7%) for patients with both ST changes and supraventricular tachycardia. We also found that the identification rate was the lowest for patients diagnosed with both atrial fibrillation and complete right bundle branch block (49%). We discuss the implications of these findings regarding the reidentification risks of patients based on ECG data and how seemingly anonymized ECG datasets can cause privacy concerns for the patients

    On the Analysis of Phylogenetically Paired Designs

    Get PDF
    As phylogenetically controlled experimental designs become increasingly common in ecology, the need arises for a standardized statistical treatment of these datasets. Phylogenetically paired designs circumvent the need for resolved phylogenies and have been used to compare species groups, particularly in the areas of invasion biology and adaptation. Despite the widespread use of this approach, the statistical analysis of paired designs has not been critically evaluated. We propose a mixed model approach that includes random effects for pair and species. These random effects introduce a “two-layer” compound symmetry variance structure that captures both the correlations between observations on related species within a pair as well as the correlations between the repeated measurements within species. We conducted a simulation study to assess the effect of model misspecification on Type I and II error rates. We also provide an illustrative example with data containing taxonomically similar species and several outcome variables of interest. We found that a mixed model with species and pair as random effects performed better in these phylogenetically explicit simulations than two commonly used reference models (no or single random effect) by optimizing Type I error rates and power. The proposed mixed model produces acceptable Type I and II error rates despite the absence of a phylogenetic tree. This design can be generalized to a variety of datasets to analyze repeated measurements in clusters of related subjects/species

    A 12-Lead ECG Database to Identify Origins of Idiopathic Ventricular Arrhythmia Containing 334 Patients

    Get PDF
    Cardiac catheter ablation has shown the effectiveness of treating the idiopathic premature ventricular complex and ventricular tachycardia. As the most important prerequisite for successful therapy, criteria based on analysis of 12-lead ECGs are employed to reliably speculate the locations of idiopathic ventricular arrhythmia before a subsequent catheter ablation procedure. Among these possible locations, right ventricular outflow tract and left outflow tract are the major ones. We created a new 12-lead ECG database under the auspices of Chapman University and Ningbo First Hospital of Zhejiang University that aims to provide high quality data enabling detection of the distinctions between idiopathic ventricular arrhythmia from right ventricular outflow tract to left ventricular outflow tract. The dataset contains 334 subjects who successfully underwent a catheter ablation procedure that validated the accurate origins of idiopathic ventricular arrhythmia
    • …
    corecore